This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Diagrams

Overview

This page covers:

  • Flow diagrams using DiagrammeR
  • Alluvial/Sankey diagrams
  • Dendrogram organizational trees (e.g. of folder contents)

Preparation

Load packages

pacman::p_load(
  DiagrammeR,     # for flow diagrams
  networkD3       # For alluvial/Sankey diagrams
  )

Flow diagrams

One can use the R package DiagrammeR to create charts/flow charts. They can be static, or they can adjust somewhat dynamically based on changes in a dataset.

Tools

The function grViz() is used to create a “Graphviz” diagram. This function accepts a character string input containing instructions for making the diagram. Within that string, the instructions are written in a different language, called DOT - it is quite easy to learn the basics.

Basic structure

  1. Open the instructions grViz("
  2. Specify directionality and name of the graph, and open brackets, e.g. digraph my_flow_chart {
  3. Graph statement (layout, rank direction)
  4. Nodes statements (create nodes)
  5. Edges statements (gives links between nodes)
  6. Close the instructions }")

Simple example

Below are two simple examples

A very minimal example:

# A minimal plot
DiagrammeR::grViz("digraph {
  
graph[layout = dot, rankdir = LR]

a
b
c

a -> b -> c
}")

An example with applied public health context:

grViz("                           # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,
         overlap = true,
         fontsize = 10]
  
  # nodes
  #######
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]               # width of circles
  
  Primary                         # names of nodes
  Secondary
  Tertiary

  # edges
  #######
  Primary   -> Secondary [label = 'case transfer']
  Secondary -> Tertiary [label = 'case transfer']
}
")

Syntax

Basic syntax

Node names, or edge statements, can be separated with spaces, semicolons, or newlines.

Rank direction

A plot can be re-oriented to move left-to-right by adjusting the rankdir argument within the graph statement. The default is TB (top-to-bottom), but it can be LR (left-to-right), RL, or BT.

Node names

Node names can be single words, as in the simple example above. To use multi-word names or special characters (e.g. parentheses, dashes), put the node name within single quotes (’ ’). It may be easier to have a short node name, and assign a label, as shown below within brackets [ ]. A label is also necessary to have a newline within the node name - use \n in the node label within single quotes, as shown below.

Subgroups
Within edge statements, subgroups can be created on either side of the edge with curly brackets ({ }). The edge then applies to all nodes in the bracket - it is a shorthand.

Layouts

  • dot (set rankdir to either TB, LR, RL, BT, )
  • neato
  • twopi
  • circo

Nodes - editable attributes

  • label (text, in single quotes if multi-word)
  • fillcolor (many possible colors)
  • fontcolor
  • alpha (transparency 0-1)
  • shape (ellipse, oval, diamond, egg, plaintext, point, square, triangle)
  • style
  • sides
  • peripheries
  • fixedsize (h x w)
  • height
  • width
  • distortion
  • penwidth (width of shape border)
  • x (displacement left/right)
  • y (displacement up/down)
  • fontname
  • fontsize
  • icon

Edges - editable attributes

  • arrowsize
  • arrowhead (normal, box, crow, curve, diamond, dot, inv, none, tee, vee)
  • arrowtail
  • dir (direction, )
  • style (dashed, …)
  • color
  • alpha
  • headport (text in front of arrowhead)
  • tailport (text in behind arrowtail)
  • fontname
  • fontsize
  • fontcolor
  • penwidth (width of arrow)
  • minlen (minimum length)

Color names: hexadecimal values or ‘X11’ color names, see here for X11 details

Complex examples

The example below expands on the surveillance_diagram, adding complex node names, grouped edges, colors and styling

grViz("                           # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,            # layout top-to-bottom
         fontsize = 10]
  

  # nodes (circles)
  #################
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]                      
  
  Primary   [label = 'Primary\nFacility'] 
  Secondary [label = 'Secondary\nFacility'] 
  Tertiary  [label = 'Tertiary\nFacility'] 
  SC        [label = 'Surveillance\nCoordination',
             fontcolor = darkgreen] 
  
  # edges
  #######
  Primary   -> Secondary [label = 'case transfer',
                          fontcolor = red,
                          color = red]
  Secondary -> Tertiary [label = 'case transfer',
                          fontcolor = red,
                          color = red]
  
  # grouped edge
  {Primary Secondary Tertiary} -> SC [label = 'case reporting',
                                      fontcolor = darkgreen,
                                      color = darkgreen,
                                      style = dashed]
}
")

Sub-graph clusters

To group nodes into boxed clusters, put them within the same named subgraph (subgraph name {}). To have the subgraph identified within a box, begin the name with “cluster” as shown below.

grViz("                           # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,            
         overlap = true,
         fontsize = 10]
  

  # nodes (circles)
  #################
  node [shape = circle,                  # shape = circle
       fixedsize = true
       width = 1.3]                      # width of circles
  
  subgraph cluster_passive {
    Primary   [label = 'Primary\nFacility'] 
    Secondary [label = 'Secondary\nFacility'] 
    Tertiary  [label = 'Tertiary\nFacility'] 
    SC        [label = 'Surveillance\nCoordination',
               fontcolor = darkgreen] 
  }
  
  # nodes (boxes)
  ###############
  node [shape = box,                     # node shape
        fontname = Helvetica]            # text font in node
  
  subgraph cluster_active {
    Active [label = 'Active\nSurveillance']; 
    HCF_active [label = 'HCF\nActive Search']
  }
  
  subgraph cluster_EBD {
    EBS [label = 'Event-Based\nSurveillance (EBS)']; 
    'Social Media'
    Radio
  }
  
  subgraph cluster_CBS {
    CBS [label = 'Community-Based\nSurveillance (CBS)'];
    RECOs
  }

  
  # edges
  #######
  {Primary Secondary Tertiary} -> SC [label = 'case reporting']

  Primary   -> Secondary [label = 'case transfer',
                          fontcolor = red]
  Secondary -> Tertiary [label = 'case transfer',
                          fontcolor = red]
  
  HCF_active -> Active
  
  {'Social Media'; Radio} -> EBS
  
  RECOs -> CBS
}
")

node shapes

The example below, borrowed from this tutorial, shows applied node shapes, and shows a shorthand for serial edge connections

DiagrammeR::grViz("digraph {

graph [layout = dot, rankdir = LR]

# define the global styles of the nodes. We can override these in box if we wish
node [shape = rectangle, style = filled, fillcolor = Linen]

data1 [label = 'Dataset 1', shape = folder, fillcolor = Beige]
data2 [label = 'Dataset 2', shape = folder, fillcolor = Beige]
process [label =  'Process \n Data']
statistical [label = 'Statistical \n Analysis']
results [label= 'Results']

# edge definitions with the node IDs
{data1 data2}  -> process -> statistical -> results
}")

Handling outputs

  • Outputs will appear in RStudio’s Viewer pane, by default in the lower-right alongside Files, Plots, Packages, and Help.
  • To export you can “Save as image” or “Copy to clipboard” from the Viewer. The graphic will adjust to the specified size.

Parameters in figures

“Parameterized figures: A great benefit of designing figures within R is that we are able to connect the figures directly with our analysis by reading R values directly into our flowcharts. For example, suppose you have created a filtering process which removes values after each stage of a process, you can have a figure show the number of values left in the dataset after each stage of your process. To do this we, you can use the @@X symbol directly within the figure, then refer to this in the footer of the plot using [X]:, where X is the a unique numeric index. Here is a basic example:”
https://mikeyharper.uk/flowcharts-in-r-using-diagrammer/

# Define some sample data
data <- list(a=1000, b=800, c=600, d=400)


DiagrammeR::grViz("
digraph graph2 {

graph [layout = dot]

# node definitions with substituted label text
node [shape = rectangle, width = 4, fillcolor = Biege]
a [label = '@@1']
b [label = '@@2']
c [label = '@@3']
d [label = '@@4']

a -> b -> c -> d

}

[1]:  paste0('Raw Data (n = ', data$a, ')')
[2]: paste0('Remove Errors (n = ', data$b, ')')
[3]: paste0('Identify Potential Customers (n = ', data$c, ')')
[4]: paste0('Select Top Priorities (n = ', data$d, ')')
")

Much of the above is adapted from the tutorial at this site

Other more in-depth tutorial: http://rich-iannone.github.io/DiagrammeR/

Alluvial/Sankey Diagrams

Preparation

Load packages

pacman::p_load(networkD3)

Plotting from dataset

Plotting the connections in a dataset

https://www.r-graph-gallery.com/321-introduction-to-interactive-sankey-diagram-2.html

Counts of age category and hospital, relabled as target and source, respectively.

# counts by hospital and age category
links <- linelist %>% 
  select(hospital, age_cat) %>%
  count(hospital, age_cat) %>% 
  rename(source = hospital,
         target = age_cat)

Now formalize the nodes list, and adjust the ID columns to be numbers instead of labels:

# The unique node names
nodes <- data.frame(
  name=c(as.character(links$source), as.character(links$target)) %>% 
    unique()
  )

# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

Now plot the Sankey diagram:


# plot
######
p <- sankeyNetwork(Links = links,
                   Nodes = nodes,
                   Source = "IDsource",
                   Target = "IDtarget",
                   Value = "n",
                   NodeID = "name",
                   units = "TWh",
                   fontSize = 12,
                   nodeWidth = 30)
p

Here is an example where the patient Outome is included as well. Note in the data management step how we bind rows of counts of hospital -> outcome, using the same column names.

# counts by hospital and age category
links <- linelist %>% 
  select(hospital, age_cat) %>%
  mutate(age_cat = stringr::str_glue("Age {age_cat}")) %>% 
  count(hospital, age_cat) %>% 
  rename(source = age_cat,
         target = hospital) %>% 
  bind_rows(
    linelist %>% 
      select(hospital, outcome) %>% 
      count(hospital, outcome) %>% 
      rename(source = hospital,
             target = outcome)
  )

# The unique node names
nodes <- data.frame(
  name=c(as.character(links$source), as.character(links$target)) %>% 
    unique()
  )

# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# plot
######
p <- sankeyNetwork(Links = links,
                   Nodes = nodes,
                   Source = "IDsource",
                   Target = "IDtarget",
                   Value = "n",
                   NodeID = "name",
                   units = "TWh",
                   fontSize = 12,
                   nodeWidth = 30)
p

https://www.displayr.com/sankey-diagrams-r/

Timeline Sankey - LTFU from cohort… application/rejections… etc.

Event timelines

E.g. border closures during COVID

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.